weather <- read.csv("./data/weather.csv")Weather Model Methods and Results
1 Methods
1.1 Observations
Our observations are obtained from Castro (2023) and contain U.S. weather stations and SNOTEL stations, providing data in the period from 1 January 2017 to 21 September 2017. Columns of the observations include
| Name | Data Type | Description |
|---|---|---|
station |
string |
Name of the weather station |
state |
string |
State/province of the weather station |
latitude |
double |
Latitude of the weather station (degrees) |
longitude |
double |
Longitude of the weather station |
elevation |
double |
Elevation of the weather station (meters) |
date |
string |
Date of the observation |
temp_min |
double |
Minimum daily temperature (fahrenheit) |
temp_max |
double |
Maximum daily temperature (fahrenheit) |
temp_avg |
double |
Average daily temperature (fahrenheit) |
av_day_wi_spd |
double |
Average daily wind speed |
wi_dir_5_sec |
double |
Wind direction at 5 second intervals (degrees) |
wi_spd_5_sec |
double |
Wind speed at 5 seconds intervals |
snow_fall |
double |
Snow water equivalence |
snow_dep |
double |
Snow depth |
precip |
double |
Precipitation |
Our interest is in creating a model from the available data using
latitudelongitudeelevationdate
as predictors of temp_avg. To prepare the data for modeling, we first load the data.
Next, we remove columns from the dataframe we are not interested in.
# Keep only our x and y columns from the data table
weather <- weather[c("date", "longitude", "latitude", "elevation", "temp_avg")]
# Drop observations with NA values for our x and y columns
weather <- weather %>% drop_na()Since the date is not useful as a string for fitting a model, we replace it with the number of days since our first observations on 1 January 2017.
# Convert "date" column from string to date objects
weather$date <- as.Date(as.character(weather$date), format="%Y%m%d")
# Create new column counting the days since Jan 1 2017
weather$day <- as.numeric(difftime(weather$date, as.Date("20170101", format="%Y%m%d"), units="days"))
# Remove date column
weather <- weather[c("day", "longitude", "latitude", "elevation", "temp_avg")]Finally, we restrict our data to the lower 48 U.S. states and the District of Columbia. To accomplish this, we must:
- Convert our
data.frameinto a geospatialsfdata frame object - Load shapefiles for the area of interest
- Exclude observations spatially located outside of the shapefile
- Convert the
sfback into adata.frameobject
The process for this follows below.
# Convert our data frame to sf
weather_sf <- st_as_sf(weather, coords = c("longitude", "latitude"), crs=st_crs("EPSG:4269"))
# Load U.S. States shapefile
states_sf <- read_sf("./data/cb_2018_us_state_500k/cb_2018_us_state_500k.shp", crs=st_crs("EPSG:4269"))
# Exclude specific states and territories
states_sf <- subset(states_sf, !(STATEFP %in% c("02","15","60","66","69","72","78")))
# Remove observations outside the remaining states and territories
weather_sf <- st_intersection(weather_sf, st_union(states_sf))
# Convert back into data frame
weather <- st_drop_geometry(weather_sf)
weather$longitude <- st_coordinates(weather_sf)[,"X"]
weather$latitude <- st_coordinates(weather_sf)[,"Y"]We now have data to which we may properly fit our model.
Before continuing, it is worth noticing the spatial distribution of the observations in Figure 1. Specifically, there is a much higher distribution of observations in the inter-mountain west, which has implications for our model. These will be discussed in more detail later.
We may plot the data in two dimensions at a time and see what kind of trends, if any, exist in the data. First, looking at the effect of latitude on temp_avg
elevation, days, latitude and longitude versus outcome temp_avg.
First looking at the data, it seems trends do exist in the data. Looking at Figure 2 (b), there is a clear and negative linear correlation between elevation and temp_avg Additionally, there is a clear and negative linear correlation between latitude and temp_avg. The two remaining predictors, longitude and days, also appear to have some type of trend, but it is not clear that they are linear.
Based from physical understanding of the weather, the correlation both latitude and elevation have are intuitive. First, atmospheric pressure decreases in proportion to elevation. In turn, Gay-Lussac’s law informs us that \[P \propto T.\] Therefore, we transitively expect temperature to decrease in proportion to elevation, all else being equal.
In addition, background knowledge about solar irradiance informs us that irradiance, which is the primary forcing effect in surface temperature, decreases in intensity in proportion to distance from the solar equator, due to the angle of incidence at which solar rays shine on the Earth.
Ignoring the effect of the atmosphere on solar irradiance, geometric reasoning tells us that solar itensity \(I\) at a given latitude, where \(I_0\) is the intensity at the equator, is \[ I = I_0 \cos{\theta}.\] However, in our data, where our latitude varies from approximately 25 to 50 degrees, this function \(I\) is roughly linear, as shown in Figure 3.
In addition, if we interpolate the data, we may view how this data changes in space and over time, though the effect of elevation is not visualized. The inverse weighted interpolation is computationally intensive, and so patience is required to generate the resulting visualization in Figure 4.
temp_avg from each station for each day in the data set.
1.2 Procedure
Owing to the fact that this model is a combination of both linear and highly non-linear physical effects, we choose to fit the data set with a Generalized Additive Model, implemented by Hastie (2023). In this type of model, we explain our outcome variable \(\hat{y}\) as the sum of smooth functions on our predictors \(x_i\), i.e.
\[\hat{y} = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n).\] For our particular predictors, we use a simple linear function \(f_i(x_i) = \beta_ix_i\) for elevation, latitude and longitude, and a nonlinear smooth function for days. Then, our particular model is
\[\text{temp_avg} = \beta_0 + \beta_1 (\text{elevation}) + \beta_2 (\text{latitude}) + \beta_3 (\text{longitude}) + f(\text{days}).\] since we hypothesize the partial derivatives of temp_avg with respect to longitude, latitude and elevation to be roughly constant.
Before we perform regression, we first split the the data in weather randomly into training and testing sets with a 7 to 3 split.
# Randomly choose our indices to sample from for our training set
sample_frac <- 0.7
train_inds <- sample(c(TRUE, FALSE), nrow(weather), replace=TRUE,
prob=c(sample_frac,1-sample_frac))
# Create training set from training set indices
train <- weather[train_inds,]
train_x <- train[, c('elevation','longitude','latitude','day')]
train_y <- train[, c('temp_avg')]
# Create testing set from negation of training set indices
test <- weather[!train_inds,]
test_x <- test[, c('elevation','longitude','latitude','day')]
test_y <- test[, c('temp_avg')]Now we fit our model.
library("gam")
gam_weather <- gam(temp_avg ~ elevation + longitude + latitude + s(day,df=5), data=train)2 Results
The resulting coefficients \(\beta_i\) in our gam_weather model are
Using our testing set of data, we calculate the root mean square deviation
# Use the model to make prediction based on our training set
gam_predict <- predict(gam_weather, test_x)
# Calculate RMSE on the residuals of our prediction and training set avg_temp
gam_rmse <- sqrt(mean((gam_predict - test_y)^2))giving us a root mean square deviation of 7.8637781. Partial residual plots for each of the independent predictors are shown below in Figure 5.
In addition, we may also show residuals per station